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Abstract 





Big Data comprises both structured and unstructured data collected from various sources. For 
collecting, managing, storing and analyzing the large dataset, an efficient tool is required. Hadoop is an 
open source framework which processes large dataset and MapReduce in Hadoop is an effective 
programming model reduces the computation time of large scale database in a distributed architecture. 
A machine and deep learning algorithm based on MapReduce implemented in huge dataset will reduce 
processing time. This paper aims to study various MapReduce based model and algorithms to analyze 
huge data. Also, predicts the way of implementing algorithms in MapReduce to reduce the computing 


time. 
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1. Introduction 

Data is being generated in all the major sectors 
include Healthcare, E-Commerce, Social media, 
Banking, Finance, etc., in the range of peta to 
Exabyte. Processing this huge dataset in a 
sequential program increases processing time, 
whereas, processing big data in a distributed 
architecture and the MapReduce programming 
model reduces the processing time with the 
increase of number of data nodes. An efficient 
processing can be discovered and automated in 
collaboration with machine and deep learning in 
MapReduce framework. Typically, MapReduce 
technique in Hadoop processes large scale dataset 
in parallel. Implementing a MapReduce-based 
machine/deep learning algorithm with increased 
number of nodes would improve efficiency and 
reduces processing time. Many authors proposed 
MapReduce based model, system and approach to 
analyze big data effectively. In this paper, those 
approaches were analyzed and compared. [1-5]. 
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1.1 Hadoop 

Hadoop is an open source Framework for big data 
by Apache to store and process big data in a 
distributed environment. Hadoop’s Architecture 
has two main components: 

Distributed File System: 

Hadoop Distributed File System is designed to 
store and process large datasets which runs on 
commodity hardware. HDFS is similar to other 
existing distributed file system but the most 
significant feature is highly fault-tolerant and can 
be deployed on low cost hardware. In addition, 
HDFS follows Master/Slave Architecture where 
metadata is stored in NameNode which acts as a 
master server and application data is stored in 
DataNode which acts as a slave server. For 
reliability, the file content is duplicated on 
DataNode. 

MapReduce: 

MapReduce is a programming model to process 
huge dataset in parallel in a_ distributed 
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environment. Mapreduce algorithm proposed by 
Google to enhance the speed by processing 
distributed big data in a cloud platform. Major 
phases involved in MapReduce phases are: 

i) Split: This phase splits the input into fixed 
number of pieces to get evaluated in map phase. 

ii) Map : In the map phase, the data from a data 
blocks get split and key-value pairs are generated 
for the data. 
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ili) Shuffle: In this phase, the key-value pairs 
generated from the map function is passed as an 
input and clubs together the similar information in 
it. 

iv) Reduce: This phase uses the output of shuffle 
phase and aggregates it, where data reduced into a 
single output value. Fig.1 depicts the architecture 
of MapReduce Framework. 





Fig.1. MapReduce Architecture 


2. Literature Review 

Nasullah Khalid Alham, Maozhen Li , Yang Liu 
and Suhel Hammoud,[3] proposed an annotation of 
image automatically using MapReduce based 
SVM. MRSMO splits the large dataset into smaller 
subset and this split subset is allocated to a map 
task. Map function present in the task optimizes the 
subset in parallel. Output of the map reduce differ 
in terms of linearity. For linear SVM, partial 
weight vector from map task penetrates into reduce 
task to get global weight vector and for non- 
linearity, the alpha array into reduce task finally 
gives global alpha array. Anan Banharnsakun[4], 
recommended a MapReduce incorporated artificial 
bee colony (MR-ABC) for clustering. This 
incorporation aims to minimize the sum of squared 
Euclidean Distance and centroid. The map function 
retrieves the cluster’s centroid from the ABC and it 
is stored in HDFS. Centroid Value extracted from 
each bee to calculate the distance value between 
the centroid values and data record to obtain the 
minimum distance. Reduce function groups the 
same key value obtained from map function to 
determine the average distance and it returned as a 
fitness value. Daniel Valcarce , Javier Parapar and 


Alvaro Barreiro[6-8] proposed a MapReduce based 
recommender system implemented Posterior 
Probability Clustering algorithm on the basis of 
matrix factorization followed by Relevance 
Models. To reduce the complexity, the algorithm is 
implemented in  MapReduce (distributed) 
framework to obtain the recommender for 
processing huge dataset. Furthermore, two join 
strategies, replication and broadcast were involved 
to make an efficient process. Weizhong Zhao, 
Huifang Ma and Qing He [9-11], recommended the 
MapReduce based PKMeans Cluster to analyse 
huge dataset effectively. Map function assigns the 
closest center to each sample and reduce function 
updates the new center and all the samples can be 
aggregated and determines the total number of 
samples. Shiva Asadianfam, Mahboubeh Shamsi 
and Abdolreza Rasouli Kenari[6] proposed a TVD- 
MRDL algorithm to automate the detection of the 
violation of drivers using MapReduce technique. 
Analysis include both structured and unstructured 
data. The proposed system able to analyze the 
traffic control center’s data and the descriptions 
predefined by police. To process the image, Deep 
Learning algorithm named CNN is_ involved. 
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Mininath Bendre, Ramchandra Manthalkar, 
performed a case study to predict the pattern of 
student behavior of UCAM students by opting an 
Azure HDInsight big data solution by using its 
HDFS implementation. The association rules for 
the events done by the students obtained by 
implementing the apriori algorithm and further 
included MapReduce framework. Neha Verma, 
Dheeraj Malhotra & Jatinder Singh[8], presented a 
novel approach using association mining for the 
analysis of market basket to know customer’s 
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expectation from retail store. Customer’s buying 
pattern analyzed using the MapReduce based 
Apriori algorithm implemented using IRM tool. 
Ms. Vandana Vijay, Dr. Ruchi Nanda2[9], 
proposed MRC-COVID system to store and 
process the Covid-19 dataset. In-Memory cache 
can be implemented on MapReduce to reduce the 
superfluous operations of disk I/O in runtime. 
Imparting cache in MapReduce improves 
performance and reduces the workload of data. 


Table.1. A short review of various Machine and Deep learning algorithms implemented in Map 


Reduce 


The comparision of various algorithms incorporated in MapReduce framework is shown in Table 1. 


















































Authors Algorithm Data Source Accuracy Comparison Objective 
; Unlabelled 
Peavey, image(50 image) To annotate the 
ae : MRSMO Multiclass 93% image 
i hae ae Classification automatically 
(5000 images) 
To minimize the 
4 Datasets (Iris, PKMeans car eo cern 
Anan Banharnsakun MR-ABC _CMC, 90% and parallel between the 
Wine, Vowel) KESO instance of data and 
cluster’s centroid 
coe Scalable and 
Daniel Valcarce , Pearson, Disteibuted based 
Javier Parapar and PPC+RM2 Netflix SVD, NMF recommender using 
Alvaro Barreiro — on MapReduce. 
Parellel K- 
Weizhong Zhao Means 
: ; : Datasets of To analyze large 
snes He and er different sizes dataset effectively 
MapReduce 
Hadoop in 
Shiva Asadianfam Efficiency | stand-alone ‘ 
- Images from the | . To detect driver 
Mahboubeh Shamsi TVD-MRDL traffic controlled increased by | mode and violations and 
and Abdolreza caniey more than Hadoop bchaviar- chances 
Rasouli Kenari 75% with more Bes. 
data nodes 
Student 
. 70 GB of : 
a Behavior . . To predict the 
Mininath Bendre, analysis in information pattern of student 
Ramchandra LMS usi regarding the . 
using . behavior of UCAM 
Manthalkar : behavior of 
Big Data UICAIM shidents students. 
Framework 
Neha Verma, MR based Beare Size up Factors: To identify the 
Dheeraj Malhotra & Apriori Database(generate provides | Speed, Size | buying pattern of 
Jatinder Singh Algorithm d using Thi tool) better result | and Scale customer 
Ms. ‘Vandana Vijay, | MRC-COVID | Covid-19 d oo 
Dr. Ruchi Nanda2 ” Ould ele P pier rs ” 
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Conclusion: 
MapReduce is an effective framework to process 
large data set in parallel. Machine learning and deep 
learning algorithm implementation in MapReduce 
results in better performance. In this paper, the 
various algorithms, systems and models proposed 
by authors related to the big data analytics in 
MapReduce model were overviewed and _ the 
efficiency of the proposed algorithms were 
discussed. There is a dearth in focusing feature 
engineering in this implementation. In future, an 
effective algorithm to process feature engineered 
large dataset to be implemented in MapReduce will 
be proposed. 
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